124 research outputs found
What Cannot be Learned with Bethe Approximations
We address the problem of learning the parameters in graphical models when
inference is intractable. A common strategy in this case is to replace the
partition function with its Bethe approximation. We show that there exists a
regime of empirical marginals where such Bethe learning will fail. By failure
we mean that the empirical marginals cannot be recovered from the approximated
maximum likelihood parameters (i.e., moment matching is not achieved). We
provide several conditions on empirical marginals that yield outer and inner
bounds on the set of Bethe learnable marginals. An interesting implication of
our results is that there exists a large class of marginals that cannot be
obtained as stable fixed points of belief propagation. Taken together our
results provide a novel approach to analyzing learning with Bethe
approximations and highlight when it can be expected to work or fail
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
Deep learning models are often successfully trained using gradient descent,
despite the worst case hardness of the underlying non-convex optimization
problem. The key question is then under what conditions can one prove that
optimization will succeed. Here we provide a strong result of this kind. We
consider a neural net with one hidden layer and a convolutional structure with
no overlap and a ReLU activation function. For this architecture we show that
learning is NP-complete in the general case, but that when the input
distribution is Gaussian, gradient descent converges to the global optimum in
polynomial time. To the best of our knowledge, this is the first global
optimality guarantee of gradient descent on a convolutional neural network with
ReLU activations
Semi-Supervised Learning with Competitive Infection Models
The goal in semi-supervised learning is to effectively combine labeled and
unlabeled data. One way to do this is by encouraging smoothness across edges in
a graph whose nodes correspond to input examples. In many graph-based methods,
labels can be thought of as propagating over the graph, where the underlying
propagation mechanism is based on random walks or on averaging dynamics. While
theoretically elegant, these dynamics suffer from several drawbacks which can
hurt predictive performance.
Our goal in this work is to explore alternative mechanisms for propagating
labels. In particular, we propose a method based on dynamic infection
processes, where unlabeled nodes can be "infected" with the label of their
already infected neighbors. Our algorithm is efficient and scalable, and an
analysis of the underlying optimization objective reveals a surprising relation
to other Laplacian approaches. We conclude with a thorough set of experiments
across multiple benchmarks and various learning settings
Gaussian Robust Classification
Supervised learning is all about the ability to generalize knowledge.
Specifically, the goal of the learning is to train a classifier using training
data, in such a way that it will be capable of classifying new unseen data
correctly. In order to acheive this goal, it is important to carefully design
the learner, so it will not overfit the training data. The later can is done
usually by adding a regularization term. The statistical learning theory
explains the success of this method by claiming that it restricts the
complexity of the learned model. This explanation, however, is rather abstract
and does not have a geometric intuition. The generalization error of a
classifier may be thought of as correlated with its robustness to perturbations
of the data: a classifier that copes with disturbance is expected to generalize
well. Indeed, Xu et al. [2009] have shown that the SVM formulation is
equivalent to a robust optimization (RO) formulation, in which an adversary
displaces the training and testing points within a ball of pre-determined
radius. In this work we explore a different kind of robustness, namely changing
each data point with a Gaussian cloud centered at the sample. Loss is evaluated
as the expectation of an underlying loss function on the cloud. This setup fits
the fact that in many applications, the data is sampled along with noise. We
develop an RO framework, in which the adversary chooses the covariance of the
noise. In our algorithm named GURU, the tuning parameter is a spectral bound on
the noise, thus it can be estimated using physical or applicative
considerations. Our experiments show that this framework performs as well as
SVM and even slightly better in some cases. Generalizations for Mercer kernels
and for the multiclass case are presented as well. We also show that our
framework may be further generalized, using the technique of convex perspective
functions.Comment: Master's dissertation of the first author, carried out under the
supervision of the second autho
Optimal Tagging with Markov Chain Optimization
Many information systems use tags and keywords to describe and annotate
content. These allow for efficient organization and categorization of items, as
well as facilitate relevant search queries. As such, the selected set of tags
for an item can have a considerable effect on the volume of traffic that
eventually reaches an item. In settings where tags are chosen by an item's
creator, who in turn is interested in maximizing traffic, a principled approach
for choosing tags can prove valuable. In this paper we introduce the problem of
optimal tagging, where the task is to choose a subset of tags for a new item
such that the probability of a browsing user reaching that item is maximized.
We formulate the problem by modeling traffic using a Markov chain, and asking
how transitions in this chain should be modified to maximize traffic into a
certain state of interest. The resulting optimization problem involves
maximizing a certain function over subsets, under a cardinality constraint. We
show that the optimization problem is NP-hard, but nonetheless has a simple
(1-1/e)-approximation via a simple greedy algorithm. Furthermore, the structure
of the problem allows for an efficient implementation of the greedy step.To
demonstrate the effectiveness of our method, we perform experiments on three
tagging datasets, and show that the greedy algorithm outperforms other
baselines
Robust Conditional Probabilities
Conditional probabilities are a core concept in machine learning. For
example, optimal prediction of a label given an input corresponds to
maximizing the conditional probability of given . A common approach to
inference tasks is learning a model of conditional probabilities. However,
these models are often based on strong assumptions (e.g., log-linear models),
and hence their estimate of conditional probabilities is not robust and is
highly dependent on the validity of their assumptions.
Here we propose a framework for reasoning about conditional probabilities
without assuming anything about the underlying distributions, except knowledge
of their second order marginals, which can be estimated from data. We show how
this setting leads to guaranteed bounds on conditional probabilities, which can
be calculated efficiently in a variety of settings, including
structured-prediction. Finally, we apply them to semi-supervised deep learning,
obtaining results competitive with variational autoencoders.Comment: 24 pages, 1 figur
Learning Infinite-Layer Networks: Without the Kernel Trick
Infinite--Layer Networks (ILN) have recently been proposed as an architecture
that mimics neural networks while enjoying some of the advantages of kernel
methods. ILN are networks that integrate over infinitely many nodes within a
single hidden layer. It has been demonstrated by several authors that the
problem of learning ILN can be reduced to the kernel trick, implying that
whenever a certain integral can be computed analytically they are efficiently
learnable.
In this work we give an online algorithm for ILN, which avoids the kernel
trick assumption. More generally and of independent interest, we show that
kernel methods in general can be exploited even when the kernel cannot be
efficiently computed but can only be estimated via sampling.
We provide a regret analysis for our algorithm, showing that it matches the
sample complexity of methods which have access to kernel values. Thus, our
method is the first to demonstrate that the kernel trick is not necessary as
such, and random features suffice to obtain comparable performance
Sufficient Dimensionality Reduction with Irrelevant Statistics
The problem of finding a reduced dimensionality representation of categorical
variables while preserving their most relevant characteristics is fundamental
for the analysis of complex data. Specifically, given a co-occurrence matrix of
two variables, one often seeks a compact representation of one variable which
preserves information about the other variable. We have recently introduced
``Sufficient Dimensionality Reduction' [GT-2003], a method that extracts
continuous reduced dimensional features whose measurements (i.e., expectation
values) capture maximal mutual information among the variables. However, such
measurements often capture information that is irrelevant for a given task.
Widely known examples are illumination conditions, which are irrelevant as
features for face recognition, writing style which is irrelevant as a feature
for content classification, and intonation which is irrelevant as a feature for
speech recognition. Such irrelevance cannot be deduced apriori, since it
depends on the details of the task, and is thus inherently ill defined in the
purely unsupervised case. Separating relevant from irrelevant features can be
achieved using additional side data that contains such irrelevant structures.
This approach was taken in [CT-2002], extending the information bottleneck
method, which uses clustering to compress the data. Here we use this
side-information framework to identify features whose measurements are
maximally informative for the original data set, but carry as little
information as possible on a side data set. In statistical terms this can be
understood as extracting statistics which are maximally sufficient for the
original dataset, while simultaneously maximally ancillary for the side
dataset. We formulate this tradeoff as a constrained optimization problem and
characterize its solutions. We then derive a gradient descent algorithm for
this problem, which is based on the Generalized Iterative Scaling method for
finding maximum entropy distributions. The method is demonstrated on synthetic
data, as well as on real face recognition datasets, and is shown to outperform
standard methods such as oriented PCA.Comment: Appears in Proceedings of the Nineteenth Conference on Uncertainty in
Artificial Intelligence (UAI2003
Tight Error Bounds for Structured Prediction
Structured prediction tasks in machine learning involve the simultaneous
prediction of multiple labels. This is typically done by maximizing a score
function on the space of labels, which decomposes as a sum of pairwise
elements, each depending on two specific labels. Intuitively, the more pairwise
terms are used, the better the expected accuracy. However, there is currently
no theoretical account of this intuition. This paper takes a significant step
in this direction.
We formulate the problem as classifying the vertices of a known graph
, where the vertices and edges of the graph are labelled and correlate
semi-randomly with the ground truth. We show that the prospects for achieving
low expected Hamming error depend on the structure of the graph in
interesting ways. For example, if is a very poor expander, like a path,
then large expected Hamming error is inevitable. Our main positive result shows
that, for a wide class of graphs including 2D grid graphs common in machine
vision applications, there is a polynomial-time algorithm with small and
information-theoretically near-optimal expected error. Our results provide a
first step toward a theoretical justification for the empirical success of the
efficient approximate inference algorithms that are used for structured
prediction in models where exact inference is intractable
Learning Rules-First Classifiers
Complex classifiers may exhibit "embarassing" failures in cases where humans
can easily provide a justified classification. Avoiding such failures is
obviously of key importance. In this work, we focus on one such setting, where
a label is perfectly predictable if the input contains certain features, or
rules, and otherwise it is predictable by a linear classifier. We define a
hypothesis class that captures this notion and determine its sample complexity.
We also give evidence that efficient algorithms cannot achieve this sample
complexity. We then derive a simple and efficient algorithm and show that its
sample complexity is close to optimal, among efficient algorithms. Experiments
on synthetic and sentiment analysis data demonstrate the efficacy of the
method, both in terms of accuracy and interpretability
- …